24 research outputs found
Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild
Despite recent achievements, existing 3D interacting hands recovery methods
have shown results mainly on motion capture (MoCap) environments, not on
in-the-wild (ITW) ones. This is because collecting 3D interacting hands data in
the wild is extremely challenging, even for the 2D data. We present InterWild,
which brings MoCap and ITW samples to shared domains for robust 3D interacting
hands recovery in the wild with a limited amount of ITW 2D/3D interacting hands
data. 3D interacting hands recovery consists of two sub-problems: 1) 3D
recovery of each hand and 2) 3D relative translation recovery between two
hands. For the first sub-problem, we bring MoCap and ITW samples to a shared 2D
scale space. Although ITW datasets provide a limited amount of 2D/3D
interacting hands, they contain large-scale 2D single hand data. Motivated by
this, we use a single hand image as an input for the first sub-problem
regardless of whether two hands are interacting. Hence, interacting hands of
MoCap datasets are brought to the 2D scale space of single hands of ITW
datasets. For the second sub-problem, we bring MoCap and ITW samples to a
shared appearance-invariant space. Unlike the first sub-problem, 2D labels of
ITW datasets are not helpful for the second sub-problem due to the 3D
translation's ambiguity. Hence, instead of relying on ITW samples, we amplify
the generalizability of MoCap samples by taking only a geometric feature
without an image as an input for the second sub-problem. As the geometric
feature is invariant to appearances, MoCap and ITW samples do not suffer from a
huge appearance gap between the two datasets. The code is publicly available at
https://github.com/facebookresearch/InterWild.Comment: Published at CVPR 202
V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map
Most of the existing deep learning-based methods for 3D hand and human pose
estimation from a single depth map are based on a common framework that takes a
2D depth map and directly regresses the 3D coordinates of keypoints, such as
hand or human body joints, via 2D convolutional neural networks (CNNs). The
first weakness of this approach is the presence of perspective distortion in
the 2D depth map. While the depth map is intrinsically 3D data, many previous
methods treat depth maps as 2D images that can distort the shape of the actual
object through projection from 3D to 2D space. This compels the network to
perform perspective distortion-invariant estimation. The second weakness of the
conventional approach is that directly regressing 3D coordinates from a 2D
image is a highly non-linear mapping, which causes difficulty in the learning
procedure. To overcome these weaknesses, we firstly cast the 3D hand and human
pose estimation problem from a single depth map into a voxel-to-voxel
prediction that uses a 3D voxelized grid and estimates the per-voxel likelihood
for each keypoint. We design our model as a 3D CNN that provides accurate
estimates while running in real-time. Our system outperforms previous methods
in almost all publicly available 3D hand and human pose estimation datasets and
placed first in the HANDS 2017 frame-based 3D hand pose estimation challenge.
The code is available in https://github.com/mks0601/V2V-PoseNet_RELEASE.Comment: HANDS 2017 Challenge Frame-based 3D Hand Pose Estimation Winner (ICCV
2017), Published at CVPR 201